Skip to content

Conversation

@tharittk
Copy link
Collaborator

@tharittk tharittk commented Aug 28, 2025

Objective

This PR has two main goals:

  • moving all MPI/NCCL communication calls from within DistributedArray and various linear operators to a single common place, namely the DistributedMixIn class, whose methods are used by both DistributedArray and the linear operators
  • implement support for mpi4py buffered communications to be used in spite of object communications in NumPy+MPI (better performance, and supported with any version of MPI/mpi4py) and CuPy+Cuda-aware MPI scenarios (whilst still falling back to object communications in CuPy+Non-Cuda-aware MPI scenario). Note that to allow users to force object communications when dealing with CuPy, a new environment variable PYLOPS_MPI_CUDA_AWARE is introduced (defaults to 1 but can be set to 0 to force object communications)

CUDA-Aware MPI

In order to have a CUDA-aware mpi4py installation mpi4py must be build against a CUDA-aware MPI installation. Since conda installations of mpi4py do not ship with a CUDA-aware MPI, it is therefore required to use pip for installing mpi4py. In the case for NCSA Delta, I create a new conda environment and do
module load openmpi/5.0.5+cuda
then
MPICC=/path/to/mpicc pip install --no-cache-dir --force-reinstall mpi4py
(where --force-reinstall is only needed because we install already mpi4py as part of the conda environment creation.

And to run the test (assuming you're in the compute node already):

module load openmpi/5.0.5+cuda
export PYLOPS_MPI_CUDA_AWARE=1
echo "TESTING **WITH** CUDA_AWARE"

echo "TEST NUMPY MPI"
export TEST_CUPY_PYLOPS=0
mpirun -n 2 pytest tests/ --with-mpi

echo "TEST CUPY MPI"
export TEST_CUPY_PYLOPS=1
mpirun -n 2 pytest tests/ --with-mpi

echo " TEST NCCL "
mpirun -n 2 pytest tests_nccl/ --with-mpi

To Do

  • So far the mpi_allgather method uses the _prepare_allgather_inputs method to prepare inputs such that they are all of the same size (via zero-padding). Whilst this is strictly needed for NCCL, we should instead consider leveraging MPI `AllGatherv' instead to avoid extra padding and cutting of arrays - Use in AllGatherv in mpi_allgather #169
  • Modify building process in Makefile and environment/requirement files: I suggest to have some targets for cuda-aware MPI where we put mpi4py in the pip section of the environment file and we ask users to set MPICC upfront (can be documented in the installation.rst section)
  • Modify MatrixMult to remove any direct call to mpi4py communication methods - Use new unified communication methods in MatrixMult #170

tharittk and others added 5 commits August 17, 2025 04:07
A new DistributedMix class is create with the aim of simpflify and unify
all comm. calls in both DistributedArray and operators (further hiding
away all implementation details).
@mrava87
Copy link
Contributor

mrava87 commented Sep 7, 2025

@tharittk great start!

Regarding the setup, I completely agree with the need to change the installation process for CUDA-Aware MPI. I have personally so far mostly relied on conda to install mpi as part of the installation of mpi4py, but it seems like this cannot be done to get CUDA-Aware MPI (see https://urldefense.com/v3/https://chatgpt.com/share/68bdf141-0658-800d-9c6c-e85aa4ab6d87;!!BgN1JKhRo9Eh4Q!SnZ79GzfYSo75i0MB4v9O_mBEnH1UA5IVYuisb-NWb0p9kRXKab9gydJlsLTleI51ozFLiVK8FDInCoRknrulElJpw$); so whilst the module load ... part would change (one may have the same luck that you have to get a pre-installed MPI with CUDA support or may need to install themselves), the second part should be universal, so we may want to add some Makefile targets for this setup 😄

Regarding the code, as I briefly mentioned offline, whilst I think this is the right way to go:

  • buffer comms for NumPy
  • have the PYLOPS_MPI_CUDA_AWARE env variable for CuPy to allow using object comms for non CUDA-Aware MPI + CuPy

i am starting to feel that the number of branches in code is growing and it is about time to put it all in one place... what I am mostly concerned is that this kind of branches will not only be present in DistributedArray but they will start to permeate into operators. I had a first go at it, only with the allgather method to give you and idea and discuss together if you think this is a good approach before we implement it for all the other comm method. The approach I took is two-fold:

  • create a _mpi subpackage (similar to _ncll) where all MPI methods are implemented with the various branches - what so far you had in the else branch in the _allreduce method in DistributedArray
  • create a mixin class DistributedMixIn (in Distributed file) where we can basically move all comm methods that are currently in DistributedArray. However, by doing so, also operators can inherit this class and access those methods - I used VStack as an example.

@astroC86 we have also talked a bit about this in the context of your MatrixMult operator. Pinging you so you can folllow this space, and hopefully once this PR is merged the bar for the implementation of operators that support all backends (Numpy+MPI, Cupy+MPI, Cupy+NCCL) will be lowered as one would just need to know what communication pattern they want to use and call the one from the mixin class without worrying about the subtleties of the different backends

@mrava87 mrava87 mentioned this pull request Sep 9, 2025
@mrava87 mrava87 changed the title Buffered communication for CUDA-Aware MPI Feat: restructuring of communication methods (and buffered communication for CUDA-Aware MPI) Sep 23, 2025
@mrava87
Copy link
Contributor

mrava87 commented Sep 23, 2025

@tharittk I worked a bit more on this, but there is still quite a bit to do (added to the TODO list in the PR main comment)....

Also i am not really sure why some tests are failing on some specific combinations of python/mpi/nranks but not on others... have not investigated yet...

@tharittk tharittk marked this pull request as ready for review October 9, 2025 08:14
Copy link
Contributor

@mrava87 mrava87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tharittk I left a few comments, Distributed is partially unfinished as we need to make sure all methods get passed the same inputs and don't rely on self.* so that they can be used by both DistributedArray and operators.

Also running tests locally I pass NumPy+MPI and CuPY+MPI but for CuPy+NCCL I get a seg fault at tests_nccl/test_solver_nccl.py::test_cgls_broadcastdata_nccl[par0] Fatal Python error: Segmentation fault. Same for you?

@mrava87
Copy link
Contributor

mrava87 commented Oct 19, 2025

@rohanbabbar04 I remember we discussed long time ago about this and you were actually the first to suggest using mixins... feel free to take a look and provide any feedback 😄

@rohanbabbar04
Copy link
Collaborator

rohanbabbar04 commented Oct 20, 2025

Thanks @tharittk and @mrava87
I will take a look into this tomorrow. 🙂

@tharittk
Copy link
Collaborator Author

@tharittk I left a few comments, Distributed is partially unfinished as we need to make sure all methods get passed the same inputs and don't rely on self.* so that they can be used by both DistributedArray and operators.

Also running tests locally I pass NumPy+MPI and CuPY+MPI but for CuPy+NCCL I get a seg fault at tests_nccl/test_solver_nccl.py::test_cgls_broadcastdata_nccl[par0] Fatal Python error: Segmentation fault. Same for you?

I don't have the problem with the CuPy + NCCL - I still got 309 test passed.

This is my seqeuence of command:
$ conda activate cuda-mpi # env that was built with cuda-aware mpi
$ module load openmpi/5.0.5+cuda # NCSA module load
$ export TEST_CUPY_PYLOPS=1
$ export PYLOPS_MPI_CUDA_AWARE=1
$ mpirun -n 2 pytest tests_nccl/ --with-mpi

I switch to mpiexec and it is still doing ok
$ mpiexec -n 2 pytest tests_nccl/ --with-mpi

@mrava87
Copy link
Contributor

mrava87 commented Oct 20, 2025

mpiexec -n 2 pytest tests_nccl/ --with-mpi

MMh interesting... I installed nccl in my newer openmpi env and I also don't get that error anymore... but I get a new one due to https://github.com/tharittk/pylops-mpi/blob/a317a884efc556419eac0b5652b67207edb3eb97/tests_nccl/test_ncclutils_nccl.py#L12... surely you must get it to, as those methods have been moved to _common?

I fixed that 😄

So I can now run locally the following with success:

make tests
make tests_nccl
export PYLOPS_MPI_CUDA_AWARE=0; make tests_gpu

Seems like that the installation I thought had Cuda-aware MPI was not and things worked as I had PYLOPS_MPI_CUDA_AWARE=0 set... since you have Cuda-aware MPI can you please test the entire suite?

make tests
make tests_nccl
export PYLOPS_MPI_CUDA_AWARE=0; make tests_gpu
export PYLOPS_MPI_CUDA_AWARE=1; make tests_gpu

Apart from this (which we definitely need to try to put on a CI (it is just too many things to do locally now...), once you can handle my code comments above we should be almost ready to merge 🚀

Copy link
Collaborator

@rohanbabbar04 rohanbabbar04 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me :)
I am also going to test it once the changes are done on a multi gpu system on chameleon cloud.

Also, I agree we should set up CI for this. Maybe I can take a look into it.

@mrava87
Copy link
Contributor

mrava87 commented Oct 24, 2025

Thanks @rohanbabbar04!

For the CI, is not so easy as Githu Lb does not offer GPU machines. In PyLops, @hongyx11 set it up using his university cluster as self-hosted runner )https://github.com/PyLops/pylops/blob/dev/.github/workflows/buildcupy.yaml). He has been looking to extend this to a multi-gpu setting for pylops-mpi but I am not sure how far he has gone 😄

@mrava87
Copy link
Contributor

mrava87 commented Oct 25, 2025

@tharittk I have finalized removing all self from the methods in Distributed and passing those variables as input instead. I have also cleaned things a bit, added docstrings and type hints, so the new additions are pretty much ready to go for me.

But I am still facing an issue with Cuda-aware MPI tests (finally got it to work!): the last 3 tests in test_stacked_math seem to fail... they are all using complex valued numbers so maybe this is the problem, but I wasn't able to really pin-point where this problem could come from in our new Cuda-aware MPI codes (I do not see the same tests failing with any other combination of backend).... Can you please check if they also fail for you and have a look if you understand why?

@tharittk
Copy link
Collaborator Author

tharittk commented Oct 26, 2025

@mrava87 I assume you mean uncommenting this line and test

# assert_allclose(linfnorm, np.linalg.norm(stacked_array1.asarray().flatten(), np.inf),

I did
mpiexec -n 2 pytest tests/test_stackedarray.py --with-mpi -s
and check the print that it executes this portion and true branch

   if deps.cuda_aware_mpi_enabled or engine == "numpy":

I have 12 tests pass and nothing fails.

@mrava87
Copy link
Contributor

mrava87 commented Oct 27, 2025

@mrava87 I assume you mean uncommenting this line and test

# assert_allclose(linfnorm, np.linalg.norm(stacked_array1.asarray().flatten(), np.inf),

I did mpiexec -n 2 pytest tests/test_stackedarray.py --with-mpi -s and check the print that it executes this portion and true branch

   if deps.cuda_aware_mpi_enabled or engine == "numpy":

I have 12 tests pass and nothing fails.

@tharittk No, I had not even noticed that 😉

What I meant was that I could only get the tests to pass if i did:

@pytest.mark.mpi(min_size=2)
@pytest.mark.parametrize("par", [(par1), (par1j), (par2),])
                                # (par2j), (par3), (par3j)])
def test_stacked_math(par):

basically comment out the 3 par*j.

Interesting that everything works fine for you. Btw, what is the -s at the end of your command to run tests?

Not sure why there are inconsistencies between versions of MPI, but I am not 100% surprised. Since I do not have pre-instaled cuda-aware MPI, my installation process to get things working (but tests for those 3 parameters) was:

conda create -n pylops_mpi_openmpi_gpu python=3.11
conda activate pylops_mpi_openmpi_gpu
conda install -c conda-forge cuda openmpi
conda install -c conda-forge numpy scipy mpi4py cupy pytest pytest-mpi
pip install -e .

export OMPI_MCA_opal_cuda_support=true
export PYLOPS_MPI_CUDA_AWARE=1
export TEST_CUPY_PYLOPS=1
mpiexec -n 2 pytest tests/ --with-mpi

@mrava87
Copy link
Contributor

mrava87 commented Oct 27, 2025

@tharittk, so if you confirm all tests run for you, and you are also confortable with the current version of the code, I am happy for this to go ahead and be merged... I can investigate on my end why I have this issue, but it should not be a blocker 😄

@tharittk
Copy link
Collaborator Author

The -s is to show the stdout. The pytest suppresses the stdout by default. I put the print() to make sure that it was the code code in CuPy+buffered MPI branch that got executed during the test.

For this,

conda create -n pylops_mpi_openmpi_gpu python=3.11

...

export OMPI_MCA_opal_cuda_support=true
..

I think it is quite different from what I did that made it work for the NCSA cluster.
I need to do something like

MPICC=/path/to/mpicc pip install --no-cache-dir --force-reinstall mpi4py

which kind of forces the installation of mpi4py to a particular MPI in the system.

I'm pretty comfortable with current code. In the meantime, I want to go checkout the DeltaAI, see if it offers CUDA-aware MPI. If so, I want to try install pylops-mpi on that system and run the test to see if everything looks ok.

@mrava87
Copy link
Contributor

mrava87 commented Oct 27, 2025

Alright, makes sense (about the -s).

And yes, I know your current setup is quite different, I just wanted to mention what I did to get Cuda-aware MPI to work for me for the sake of record… since I don’t have a pre-installed MPI I can rely on, I decided (with the help of ChatGPT) to go the conda way so I don’t need to rely on any system installation of Cuda either… makes sense that you need to do some re-install with pip as you had already installed MPI4Py with conda and point to MPICC, for me it seems that since I install also MPI directly in my environment it is picked up automatically when installing MPI4Py😀

Happy to either wait until you test it on Delta or merge if you prefer (and we can always handle any issue that arises, if it arises, in a separate PR)?

@tharittk
Copy link
Collaborator Author

Yes, sure we can handle the issue that may arise separately.

I have been trying to get into the DeltaAI compute node for some time now (they only have GH200). It is a challenge to get an allocation in particular because I need the interactive shell to test the installation.

@mrava87 mrava87 merged commit 5362ed2 into PyLops:main Oct 28, 2025
61 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants